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Abstract: The Test of English as a Foreign Language (TOEFL) brings tremendous influence to EFL (English 
as a Foreign Language) learners worldwide. TOEFL 2000 project claims that TOEFL, as a more reflective of 
communicative model, could provide more information about international students’ language ability that it is 
supposed to measure. However, after detailed analyzing an authentic paper-based test paper in May, 2001 in China 
as a sample from four aspects — test reliability, construct validity, authenticity and interactiveness respectively, it is 
found that the test puts too much emphasis on vocabulary and grammar knowledge within almost every session of 
the test paper, in which “structure and written expression” could be the most disputed part. The content could not 
fully demonstrate its validity and communicative purposes so that it is suspected that test takers could meet the 
later demands in academic study abroad. Nevertheless, this is a powerful explanation about the current 
revolutionized change in the framework and content of TOEFL to meet the principles of designing a test, which 
could provide more information and guidance for later test designs. 
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During the couple of years, TOEFL (Test of English as a Foreign Language) has undergone a revolutionized 
change in the test content and framework. What makes the change? What are the changes? What are the 
implications in the changes? Answering these three W-questions could provide us a guideline for making language 
tests much more reliable, valid, authentic and interactive in accordance with the communicative language teaching 
worldwide. 

1. Background knowledge about the TOEFL 

The eagerness of learning a foreign language promotes the development of foreign language learning. In 
order to prove one’s language proficiency, the TOEFL, as one form of international language tests, becomes the 
dominant type worldwide. It is slightly different from tests in the classroom. It has no fixed content that have been 
taught to test takers, which decides its wide range and general contents towards EFL learners worldwide. It is 
rather a proficiency test than an achievement test since it measures someone’s language abilities at a certain time. 
The TOEFL test is norm-referenced test but not criterion-referenced one since test results are interpreted with 
reference to the performance of a certain group, whose performance is used to relates one candidate’s performance 
to that of other candidates (Hughes, 1989, pp. 17-18), that is, to obtain meaning from the referenced scores (Ebel 
& Frisbie, 1991, p. 34). 

TOEFL 2000 project claims that TOEFL is “more reflective of communicative competence models” and it 
“provides more information than current TOEFL scores do about international students’ ability to use English in 
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an academic environment” (Jamieson, et al., 2000, p. 3). Before the birth of the TOEFL 2000 project, some 
researchers categorized TOEFL as a non-communicative test. But does it really make a revolutionized change? As 
Morrow (1986, p. 9) mentions that in communicative testing, “What we are concerned with is the performance of 
an individual performing a set of tasks in a foreign language”. Can it really attain its ambitious goals? 

According to the TOEFL 2000 project, the traditional TOEFL test exams one’s language competence in 
listening, reading and writing skills, among which integrated with vocabulary and structure knowledge for the 
years around. Moreover, there are standard procedures for administering and scoring the test and TOEFL that is 
held systematically in fixed work-based worldwide and the total paper-based test score is now reported on a scale 
that ranges from 310-677, while TWE (Test of Written English) score is reported separately on a scale of 1-6. 
Finally, through a process of empirical research and development, the characteristics of the tests are well-known, 
and the testees even have suggestions and tips of preparing a TOEFL test, which are provided by Educational 
Testing Service (ETS). In the survey done by Brown and Ross (1996, p. 233), there are approximately 85.2% 
testers using the TOEFL test score for graduate, undergraduate studies or another type of school, 13.8% ones for a 
license or a company and only 1% people give no reason for taking the TOEFL tests among 20,000 randomly 
selected testees. Evidently, more and more people use TOEFL score as a proof to demonstrate the individual 
language proficiency to meet the later requirements from both academic degree programmes and ESL learning as 
well, even though there is no standard criterion to define which score is a “pass” and which is a “failure”. 

As a large scale proficiency test, TOEFL is designed to measure people’s language abilities. However, it is 
not a test to discover whether someone has adequate command of the language for a particular purpose but rather 
the one with more general concept. It is a common sense that TOEFL has been thrived for a long period to meet 
the global requirements on EFL testing due to either its rationality or its exclusiveness, but definitely it is meeting 
the new challenges from other test systems as the time goes by. For instance, more and more countries, especially 
the European countries adopted International English Language Testing System (IELTS) as a main assessment of 
English proficiency. This is not the national preference makes the tendency but undoubtedly reflects the basic 
considerations and appealing that come from the test principles. 

2. Study on the TOEFL paper 2001 

In order to have better understanding about some revolutionized changes of TOEFL in recent years, it is 
sensible to have a review on its tests based on TOEFL 2000 project. 

2.1 Test framework 

Take one TOEFL paper-based test for example, it was taken in May, 2001 in China generally. The whole 
structure of the test paper mainly consists of four parts: 

(1) Section 1: Test of Written English (TWE) (30 minutes); 

(2) Section 2: Listening Comprehension (30 minutes); 

(3) Section 3: Structure and Written Expression (25 minutes); 

(4) Section 4: Reading Comprehension (45 minutes). 

Among the sections, section 2, 3 and 4 are timed tests in multiple-choice format with four options for each 
question. TOEFL, as a popular norm-referenced test for the whole world, is designed not based on certain contents 
or a language course but to meet the fundamental and necessary requirement of using language — to communicate. 
As the foremost aspect in the criteria, thus, we have to be careful about the designing and to reconsider the 
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function of the tests. To measure language proficiency in almost every aspect of situations, we need to take 
account of when, where, how, why and what is to be used. Therefore, how would the tests be as representative as 
possible is the key issue in designing language tests. Bachman and Palmer (1996, pp. 19-25) provide us some 
basic criteria which need to be reflected in the test paper. They are test reliability, construct validity, authenticity 
and interactiveness. 

2.2 Test reliability 

The concept of reliability is particularly important in the language tests. Although we can never have 
complete trust in any set of the scores, we try to produce a perfect and consistent test score which is free from 
measurement error mainly intrigued by different testing times, test forms, raters and other characteristics of the 
measurement context, that is, to concern the consistency of test judgements and results (Bachman, 1990; Hughes, 
1989; Weir, 1990; Davies, 1990). And the highly reliable score ought to be “accurate, reproducible and 
generalizable to other testing occasions and other similar test instruments” (Ebel & Frisbie, 1991, p. 76). In 
TOEFL, there are two components of test reliability we need to consider, one is the performance of testers and the 
other is the reliability of the scoring. Let’s look at the data provided by ETS diachronically. In China, there were 
31,462 students took TOEFL CBT between July 1999 and June 2000, in which the average scores in three parts 
listening, structure and reading were 20, 21, 21 respectively and the mean of total score was 206. Between July, 
2001 and June, 2002 there were 58,772 students took TOEFL CBT, and they got 20, 21, 21 separately in three 
parts, the mean of total score was 207 (TOEFL test score and data 2000-2001, 2002-2003). From the data above, 
we could find that the scores of Chinese students generally cluster around the 20 level and the reliability estimates 
were well within the desirable range and substantial. Part of the reason is that the mark of TWE does not add into 
the whole score so that other three sessions require no judgement on scoring for the testing format, and could be in 
practice carried out by a computer, thus, the main part of TOEFL test is said to be objective and highly reliable. 

2.3 Test validity 

It seems to be axiomatic that “validity cannot be established unless reliability is also established for specific 
contexts of language performance” (Cumming & Mellow, 1995, p. 77). “A test, part of a test, or a testing 
technique is said to have construct validity if it can be demonstrated that it measures just the ability which it is 
supposed to measure” (Hughes, 1989, p. 26). If test scores are affected by other abilities rather than the one we 
want to measure, they will not be the satisfactory interpretation of the particular ability. In this TOEFL test paper 
of May 2001, if we look at each session rather than the holistic structure, reading comprehension won’t cause too 
much concern since it is fairly demonstrate, which measures a distinct ability. There are five pieces of articles, 
related with social science, biology, literature, ethology and geology, which covered wide varieties of topics. 
Including these fifty questions, the whole reading comprehension has 3,673 words, which means that the testees 
need to finish reading in about 82 words per minute. This is a high demand for EFL learners who need to prove 
their abilities in language knowledge as well as cultural background knowledge. What’s more the reading part not 
only questions the related information but also questions the implied meaning and even the specific meaning of a 
certain word. From those aspects we need the skills of reading both extensively and intensively. If “the purpose, 
events, skills, functions, levels are carried out as what they are expected to” (Carroll, 1980, p. 67), the construct 
validation is fully displayed in the TOEFL reading part. 

2.4 Test authenticity and interactiveness 

The other two principles we need to concern are authenticity and interactiveness. “Authenticity provides a 
means for investigating the extent to which score interpretations generalize beyond performance on the test to 
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language use” (Bachman & Palmer, 1996, pp. 23-24), which means the task that the test set is correspond with the 
content of the test. In the language test, authenticity sometimes distantly related with real communicative tasks by 
carrying out series of linguistic skills rather than genuine operational ones for reliability and economy (Carroll, 
1980, p. 37). The listening comprehension in TOEFL test simulates the speaking environment in the North 
American colleges or universities and adds some idiomatic expressions common to spoken English to attain the 
features of the target language usage, which we could say this session provides the authentic materials in a certain 
extent. Nevertheless, for the language proficiency if we only test listening or reading, the whole test are not fully 
activated and we would never have the generalized idea about the testees’ language standard so that the test could 
not be called successful at all. 

Interactiveness refers to the extent and type of involvement of the test taker’s individual characteristics in 
accomplishing a test task (Bachman & Palmer, 1996, p. 25). Due to the different areas of language knowledge, 
planning strategy and personality, how could we give each testee a fair chance is always a question. TOEFL test 
demonstrates this point by offering a general topic in writing, by providing standard written English in grammar 
structure, and by covering various topics in reading, however, we still could find something which is too 
“Americanized”. For instance, the pronunciation of listening comprehension is sounded in American way which 
seems to be a hard work for the learners whose first language is not English worldwide. 

Compared with the TOEFL listening section, Cambridge First Certificate in English (FCE) provides a variety 
of accents in both standard variants of English native speaker accent and English non-native speaker accents 
(Cambridge FCE Handbook, 1997). These designs in FCE initiate the similar environment in English countries 
and make the whole test more communicative and practical. Many articles of reading comprehension concern lots 
of American topics but fairly rare non-American ones, although it seems to cover abundant topics. As Hilke and 
Wadden (1997, p. 36) note that “what certain TOEFL texts choose to include, moreover, is often as significant as 
what they fail to include”. In this test paper, two fifths of the reading contents attach closely with American 
history background. Thus, whether the TOEFL test provides each candidate a fair chance is not clearly 
demonstrated. 

3. Analyzing the “language” knowledge in the TOEFL 

In the framework of the language structure put forward by Bachman and Palmer (1996, pp. 68-75), we could 
infer that learners’ language ability consists of two parts, one is language knowledge and the other is strategic 
competence/metacognitive strategy. That is to say, learners need to know the vocabulary, grammar, sound system 
as well as to use the coherent sentences in a certain language setting to achieve the communicative goals of 
language users. The TOEFL test, the way to demonstrate candidates’ achievement in English, should determine 
whether they could apply the knowledge and skills in their future real-life study, that is, to assess their 
performance in this language. This is the main reason to construct the tests to get the information: “How well 
individuals perform on the test represents to some degree how they might be expected to respond outside the 
testing environment” (Sax, 1997, p. 304). However, we do not expect a test can measure all the aspects of 
language in each section, thus, the samplings should be as represented as possible. And here more emphasis will 
be put on the grammatical knowledge part in the TOEFL test. 

3.1 Testing grammatical knowledge in writing, listening and reading skills 

Grammatical knowledge mainly includes three parts: vocabulary, syntax and phonology (Bachman & Palmer, 
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1996, p. 70). In this TOEFL test paper, knowledge of vocabulary seems to be tested in all the sections, which 
proves the common sense that words are basic building blocks of language. Vocabulary, which is embedded, 
comprehensive and context dependent in nature, plays an explicit role in the assessment of learners’ performance 
(Read & Chapelle, 2001). The best way to test people’s vocabulary is to use various ways to test either the basic 
meaning of a word, or its derived form, its collocations or its meaning relationship in a context. Nation (1990) 
gives a systematic list of competencies which has come to know as types of word knowledge, which are (1) 
spoken form of the word; (2) written form of the word; (3) grammatical behaviour of the word; (4) collocational 
behaviour of the word; (5) frequency of the word; (6) stylistic register constrains of the word; (7) conceptual 
meaning of the word; (8) associations the word has with other related words (Schrutt, 1999, p. 194). These word 
knowledge types decide the meaning of knowing a word, thus, if we want to analyze the construct validity of 
vocabulary items in TOEFL, whether the meaning sense is typical way of usage in an academic context in the 
future is the key element. Schrutt (1999, p. 192) also points out: “Although any individual vocabulary item is 
likely to have internal content validity, there are broader issues involving the representativeness of the target 
words chosen”. 

In TWE it checks not only the written form of the words but also the function and collocations of their 
grammatical usage. Cumming and Mellow (1995, p. 77) define a general ESL composition profile, which is 
“vocabulary (range, choice, usage, word form mastery, register), language use (complex constructions, errors of 
agreement, tense, number, word order/function, articles, pronouns, prepositions) and mechanics (spelling, 
punctuation, capitalization, paragraphing)”. The testees need to finish a composition in 30 minutes which is 
constituted by more than 300 words are more preferable. However, the limitation in TWE is its limited styles of 
writing. Like the topic in this test paper, most of the writing style in the TOEFL is a contrastive writing to show 
personal preference or the choice. Although the writing section is not the specific part to test grammatical 
knowledge, whether the sample chosen in the TOEFL test is truly the representative of the communicative 
competence is still a question. 

In listening comprehension, testing vocabulary is not limited to single word any more. There are many 
compound words, phrases and even idiomatic expressions and slang. For example, in May, 2001 test paper, there 
are some idioms in the dialogues between two speakers like “ have something checked out, headed one’s way, big 
show storm, get a little carried away, that sure beats sticking around here” etc. Since most of the dialogues are 
selected from American daily life, lots of phrases and sentences cause great difficulty for EFL testees since it is 
difficult to work out the meaning by the surface meaning of the words. Moreover, both the conversation and the 
choices have a high demanding on grammar to require the testers give definite response in fifteen seconds. For 
example, four choices in No. 8 display four different tenses: the present simple, the past perfect, the subjunctive 
mood in future sense and the future tense. And the dialogue in No. 8 is: 

M: My back has been aching ever since I started playing tennis on the weekends. 

W: Haven't you had that checked out yet? 

Q: What does the woman imply? 

From this short dialogue, we notice that usually the first person present the content or the background of their 
conversation and the second person gives the hint to the answer of the question. Summing up the questions from 
first thirty short dialogues, we get the following results (Table 1): 
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Table 1 Questions of 30 dialogues in the TOEFL test. May 2001 



Typical questions 


Percentage 


What does the man/woman imply? 


37% 


What does the man/women mean? 


33% 


What does the man/woman suggest 


13% 


What can be inferred from the conversation? 


10% 


Others 


7% 



From the type of questions, it is not so difficult to find out that answering these questions in “listening 
comprehension” needs either fluency and consolidated grammatical knowledge. Listening comprehension test is 
much more a combination of testing on both vocabulary and syntax. 

The communicative philosophy of reading test is to test “in what situations do we read which texts for which 
purposes” (Wijgh, 1995, p. 155). Originally, TOEFL tests had vocabulary items, which were selective and 
context-independent multiple-choice items presenting words in isolation. They were criticized since international 
students simply spent time unproductively memorizing long list of words together with synonyms or definitions 
(Read & Chapelle, 2001, p. 14). And now the prominent feature in vocabulary items still exists in the TOEFL 
reading comprehension subtest, that is, the testing on the meaning of words or short phrases. Banerjee and 
Clapham (2003, p. 116) point out that although the previous section in the TOEFL test called the reading and 
vocabulary and now it is renamed as reading comprehension, which still consists two distinct tests: Reading and 
vocabulary. In this test paper, there are 20 questions related to the close meaning or referring meaning of the 
words or phrases, in which 16 of them are questions about words. These questions take up two fifths of the overall 
reading questions, and the second article has the largest number of questions, which are five in ten. These 
questions always demonstrate in several fixed way: “ The word ‘lured’ in line 19 is closest in meaning to...; the 
word “them” in line 11 refers to...”. Although the “closest in meaning to” questions concern much of the word 
meaning in the context, the rest of the word questions seem to assess the range of candidates’ vocabulary. And 
sometimes without referring back to the contents, testees still could get the answer if they simply know the 
meaning of words. As what Read (1997, 2000, cited by Read & Chapelle, 2001) has said those vocabulary items 
in the reading test of the TOEFL can be categorized into the relative independent group, despite the manner in 
which they are presented. Is it another section which focuses on the vocabulary again? 

3.2 Testing grammar knowledge independently 

Assessing language knowledge is always reflected in the “four-basic skills”, speaking, listening, reading and 
writing. But considering some well-known proficiency tests erase the grammar component (Hughes, 1989, p. 141), 
the “structure and written expression” still remains as one part of the TOEFL tests, whose contents are similar to 
the section “use of English” in the First Certificate in English (FCE) in Cambridge Level Three. Wall, et al (1991, 
p. 214) suggest that if we want to decide the content validity, several elements need to be determined, that is, 
whether the tasks they are testing are the ones they intend to test; whether the sampling of tasks is adequate; and 
whether the level of difficulty of its components is proper. The principles for communicative language learning 
guiding test construction in “structure and written expression” suggest that testees should know how to use 
different structures and useful expressions in language output to be effective and efficient speakers and writers, 
which could satisfy the original purposes of studying in North America. In FCE, testers are expected to 
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